Towards a Structured Representation of Generic Concepts and Relations in Large Text Corpora
نویسندگان
چکیده
Extraction of structured information from text corpora involves identifying entities and the relationship between entities expressed in unstructured text. We propose a novel iterative pattern induction method to extract relation tuples exploiting lexical and shallow syntactic pattern of a sentence. We start with a single pattern to illustrate how the method explores additional paterns and tuples by itself with increasing amount of data. We apply frequency and correlation based filtering and ranking of relation tuples to ensure the correctness of the system. Experimental evaluation compared to other state of the art open extraction systems such as Reverb, textRunner and WOE shows the effectiveness of the proposed system.
منابع مشابه
A New Method for Improving Computational Cost of Open Information Extraction Systems Using Log-Linear Model
Information extraction (IE) is a process of automatically providing a structured representation from an unstructured or semi-structured text. It is a long-standing challenge in natural language processing (NLP) which has been intensified by the increased volume of information and heterogeneity, and non-structured form of it. One of the core information extraction tasks is relation extraction wh...
متن کاملخوشهبندی اسناد مبتنی بر آنتولوژی و رویکرد فازی
Data mining, also known as knowledge discovery in database, is the process to discover unknown knowledge from a large amount of data. Text mining is to apply data mining techniques to extract knowledge from unstructured text. Text clustering is one of important techniques of text mining, which is the unsupervised classification of similar documents into different groups. The most important step...
متن کاملA Generic Analysis of the conclusion section of Research Articles in the field of sociology: A Comparative study
This paper reported on a genre-driven comparative study, which aimed to identify the generic moves in the conclusion sections of twenty research articles in the field of sociology written in the two codes of Persian and English. To meet this purpose, the researchers employed Moritz, Meurer, and Dellagnelo's model, which was set within the Swalesian framework of genre analysis. The analysis was ...
متن کاملA Joint Semantic Vector Representation Model for Text Clustering and Classification
Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different applications, their shortcoming in capturing semantic concepts of text motivated researches to use...
متن کاملTowards Unrestricted, Large-Scale Acquisition of Feature-Based Conceptual Representations from Corpus Data
In recent years a number ofmethods have been proposed for the automatic acquisition of feature-based conceptual representations from text corpora. Such methods could offer valuable support for theoretical research on conceptual representation. However, existing methods do not target the full range of concept-relation-feature triples occurring in human-generated norms (e.g. flute produce sound) ...
متن کامل